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Agronomique 

The aim of this paper is to provide a new method for the detec- 
tion of either favored or avoided distances between genomic events 
along DNA sequences. These events are modeled by a Hawkes' pro- 
cess. The biological problem is actually complex enough to need a 
non asymptotic penalized model selection approach. We provide a 
theoretical penalty that satisfies an oracle inequality even for quite 
complex families of models. The consecutive theoretical estimator is 
shown to be adaptive minimax for holderian functions with regularity 
in (1/2, 1]: those aspects have not yet been studied for the Hawkes' 
process. Moreover we introduce an efficient strategy, named Islands, 
which is not classically used in model selection, but that happens 
to be particularly relevant to the biological question we want to an- 
swer. Since a multiplicative constant in the theoretical penalty is not 
computable in practice, we provide extensive simulations to find a 
data-driven calibration of this constant. The results obtained on real 
genomic data are coherent with biological knowledge and eventually 
refine them. 

1. Introduction. Modeling the arrival times of a particular event on the real line is a 
common problem in time series theory. In this paper we deal with a very similar but fewly 
addressed problem: modeling the process of the occurrences of a particular event along a 
discrete sequence, namely a DNA sequence. Such events could be for instance any given DNA 
patterns, any genes or any other biological signals occurring along genomes. A huge literature 
exists on the statistical properties of pattern occurrences along random sequences [18] but 
our current approach is different. It consists in directly modeling the point process of the 
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occurrences of any kind of events and it is not restricted to pattern occurrences. Our aim is 
to characterize the dependence, if any, between the event occurrences by pointing out either 
favored or avoided distances between them, those distances being significantly larger than the 
classical memory used in the quite popular Markov chain model for instance. At this scale, 
it is more interesting to use a continuous framework and see occurrences as points. A very 
interesting model for this purpose is the Hawkes' process [12]. 

In the most basic self-exciting model, the Hawkes' process (Nt)teM. is defined by its intensity, 
which satisfies 

(1.1) \(t) = u+ I h(t-u)dN u , 

J — oo 

where v is a positive parameter, h a nonnegative function with support on R + and J h < 1 
and where dN u is the point measure associated to the process. The interested reader shall 
find in Daley and Vere-Jones' book [9] the main definitions, constructions and models related 
to point processes in general and Hawkes' processes in particular (see for instance Examples 
6.3(c) and 7.2(b) therein). 

The intensity A(i) represents the probability to have an occurrence at position t given all the 
past. In this sense, (1.1) basically means that there is a constant rate v to have a spontaneous 
occurrence at t but that also all the previous occurrences influence the apparition of an 
occurrence at t. For instance an occurrence at u increases the intensity by h(t — u). If the 
distance d = t — u is favored, it means that h(d) is really large: having an occurrence at u 
significantly increases the chance of having an occurrence at t. The intensity given by (1.1) 
is the most basic case, but variations of it enable us to model self inhibition, which happens 
when one allows h to take negative values (see Section 2.4) and, in the most general case, 
to model interaction with another type of event. The drawback is that, by definition, the 
Hawkes process is defined on an ordered real line (there is a past, a present and a future). 
But a strand of DNA itself has a direction, fact that makes our approach quite sensible. 

The Hawkes' model has been widely used to model the occurrences of earthquake [23]. In 
this set-up and even for more general counting processes, the statistical inference usually deals 
with maximum likelihood estimation ([16], [17]). This approach has been applied to genome 
analysis: in a previous work [12], Gusto and Schbath's method, named FADO, uses maximum 
likelihood estimates of the coefficients of h on a Spline basis coupled with an AIC criterion 
to select the set of equally spaced knots. 

On one hand, the FADO procedure is quite effective, it can manage interactions between 
two types of events and self excitation or inhibition, ie it works in the most general Hawkes' 
process framework and produces smooth estimates. However, there are several drawbacks. 
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From a theoretical point of view, AIC criterion is proved to select the right set of knots if 
first, there exists a true set of knots, and then if the family of possible knots is held fixed 
whereas the length of the observed sequence of DNA tends to infinity. Moreover from a 
practical point of view, the criterion seems to behave very poorly when a lot of possible sets 
of knots with the same cardinality are in competition [11]. FADO has been implemented with 
equally spaced knots for this reason. Finally it heavily depends on an extra knowledge of the 
support of the function h. In practice, we have to input the maximal size of the support, say 
10000 bases, in the FADO procedure. Consequently the FADO estimate is a Spline function 
based on knots that are equally spaced on [0,10000]. If this maximal size is too large, the 
estimate of h will probably be small with some fluctuations but not null until the end of the 
interval, whereas it should be null before (see Figure 12 in Section 5). 

On the other hand, our feeling is that if interaction exists, say around the distance d = 500 
bases, the function h to estimate should be really large around d = 500 and if there is no 
biological reason for any other interaction, then h should be null anywhere else. 

One way to solve this problem of estimation is to use model selection but in its non 
asymptotic version. Ideally, if the work of Birge and Massart in [5] was not restricted to the 
Gaussian case but if it also provides results for the Hawkes' model then it should enable us to 
find a way of selecting an irregular set of knots with complexity that may grow if the length 
of the observed sequence becomes larger. The question of the knowledge of the support never 
appears in Birge and Massart's work because there is not such a question in a Gaussian model, 
but one could imagine that their way of selecting sparse models should enable us to select a 
sparse support too. 

However we are not in an ideal world where white noise model and Hawkes' model are 
equivalent (even heuristically), so there is no way to guess the right way of penalizing in 
our situation. So the purpose of this article is to provide a first attempt at constructing a 
penalized model selection in a non asymptotic way for the Hawkes' model. This paper consists 
in both practical methods for estimating h that lie on theoretical evidences and also in new 
theoretical results such as oracle inequalities or adaptivity in the minimax sense. Note that up 
to our knowledge, the minimax aspects of the Hawkes' model have not yet been considered. 

Accordingly we restrict ourselves to a simpler case than the FADO procedure. First we focus 
on the self-exciting model (ie the one given by (1.1), where h is assumed to be nonnegative) , 
but we would at least like that the final estimator remains computable in case of self-inhibition. 
Then we do not use maximum likelihood estimators since they are not easily handle by 
model selection procedures, at least from a theoretical point of view. So we provide in this 
paper theoretical results for penalized projection estimators (i.e. least square estimators) 
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and not for penalized maximum likelihood estimators (see Chapter 7 of [15] for a complete 
comparison of both contrasts in the density setting from a model selection point of view). 
Finally, for technical reasons, we only deal with piecewise constant estimators. Once all those 
restrictions are done, the gap between the theoretical procedure and the practical procedure 
is consequently reduced to a practical calibration problem of the multiplicative constants. 

Since the Hawkes' processes are quite popular for modeling earthquakes, financial or eco- 
nomical data, we try to keep a general formalism in most of the sequel (except in the biological 
applications part). Consequently our method could be applied to many other type of data. 

In Section 2, we define the notations and the different families of models. Section 3 states 
first a non asymptotic result for the projection estimators, since up to our knowledge, these 
estimators were not yet studied. Then Section 3 gives a theoretical penalty that enables us 
to select a good estimator in a family of projection estimators. Indeed we prove that our 
penalized projection estimator satisfies an oracle inequality, hence proving by that result 
that our estimator is as good as the best projection estimator in the family up to some 
multiplicative term. However the multiplicative constant in the theoretical penalty is not 
computable in practice. As a consequence Section 4 provides simulations which validate a 
calibration method that seems to work well from a practical point of view. Then in Section 5 
we apply this method to DNA data. The results match biological evidences and refine them. 
Section 6 details the adaptive and minimax properties of our estimators. Section 7 is dedicated 
to more technical results that are at the origin of the ones stated in Section 3. Proofs can be 
found in Section 8. 

2. Framework. Let (Nt)t be a stationnary Hawkes' process on the real line satisfying 
(1.1). We assume that h has a bounded support included in (0, A] where A is a known positive 
real number and that 



satisfies p < 1 . This condition guarantees the existence of a stationnary version of the process 
(see [13]). Let us remark that, for the DNA applications we have in mind, A is quite known 
because it corresponds to a maximal distance from which it is no longer reasonable to consider 
a linear interaction between two genomic locations. If there may exist some interaction at 
longer distances, then it should certainly imply the 3D structure of DNA. 

We observe the stationnary Hawkes' process (Nt)t on an interval [— A, T], where T is a 
positive real number. Typically T should be significantly larger than A. Using this observation, 
we want to estimate 





(2.2) 



s = (u, h) 
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assumed to be in 



(2.3) L 2 = |/ = (/i, 5 ) : g with support in {Q,A}, ||/|| 2 = ^ + J g 2 (x)dx < +oo| . 

The introduction of this Hilbert space is related to the fact that we want to use least square 
estimators. 

With these constraints on h, we can note that (1.1) is equivalent to 

(2.4) \(t) = v+ f h{t-u)dN u . 

Jt-A 

Now we can introduce intensity candidates: for all / = (/i,<?) in L 2 , we define 

(2.5) V f (t) :=fi + [ g(t-u)dN u . 

Jt-A 

In particular, note that ^ s (t) = X(t). A good intensity candidate should be a ^/(-) that is 
close to ^ s (.). The least-square contrast is consequently defined for all / in L 2 by 



(2.6) 7 t(/) := ~ £ #f(t)dN t + ± £ V f (t) 2 dt. 

As we will see in Lemma 3, this really defines a contrast, in the statistical sense. Indeed, 
taking the compensator of the previous formula leads to 

~ £ * f (m s (t)dt +±£ * f (t) 2 dt. 

Let us consider the last integral in the previous equation: 

(2.7) D 2 {f):=±£* f {tfdt. 
Lemma 2 proves that Dj,{.) defines a quadratic form on L 2 such that 



(2-8) \f\ D := y/E(DUf)) 

is a quadratic norm on L 2 , equivalent to ||/|| (see (2.3)). In this sense, we can see 7r(/) as an 
empirical version of ||/ — sj'jj — Wsj'jj, which is quite classical for a least-square contrast (see 
the density set-up for instance in [15]). 
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2.1. Projection estimator. Let m be a set of disjoint intervals of (0, A]. In the sequel m 
is called a model and |m| denotes the number of intervals in m. One can think of m as a 
partition of (0, A] but there are other interesting cases as we will see later. Let S m be the 
vectorial space of L 2 defined by 

(2.9) S m = \f = (»,g) G L 2 such that g = ^ aj— ^= with (aj) Jem G M m 1 , 

where ^(1) = J tjdt. We say that g in the above equation is constructed on the model m. 
Conversely, if g is a piecewise constant function, remark that we can define a resulting model 
m by the set of intervals where g is constant but non zero and a resulting partition by the set 
of intervals where g is constant. The projection estimator, s m , is the least square estimator 
of s defined by 

(2-10) s m ■= argmin /e5m 7 T (/). 

Of course the estimator s m heavily depends on the choice of the model m. That is the main 
reason for trying to select it in a data driven way. Model selection intuition usually relies on a 
bias- variance decomposition of the risk of s m . So let us define s m as the orthogonal projection 
for ||.|| of s on S m . Then s m is a "good" estimate of s m , since 7t(/) is an approximation of 
II / ~~ s Id — \\ s \\d- We cannot prove that it is an unbiased estimate, but the intuition applies. 
So the bias can be more or less identified as ||s — s m \\ 2 . This is the approximation error of the 
model m with respect to s. As we will see in Proposition 1 and the consecutive comments, 
one can actually prove that 



E(\\S - SmV) - C T 



II 2 



771 



8 mj + j, 



where Ct is a positive quantity that slowly varies with T. So the variance or stochastic error 
may be identified as \m\/T. We recover a bias-variance decomposition where the bias decreases 
and the variance increases. Finding a model m in a data driven way that almost minimizes 
the previous equation is the main goal of model selection. However there is no precise shape 
for the quantity Ct- We consequently use the most general form of penalization in the sequel. 

2.2. Penalized projection estimator. Let M.t be a family of sets of disjoint intervals of 
(0, A] (i.e. a family of possible models). We denote by #{A^t} the total number of models. 
We define the penalty (or penalty function) by pen : TWy — > M + and we select a model by 
minimizing the following criterion: 

(2.11) rh := arguiin^^ [jT{s m ) + pen(m)] . 
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Then the penalized projection estimator is defined by 



(2.12) 



s = (z>, h) 



The main problem is now to find a function pen : — > K + that guarantees that 



(2.13) 



s — s\\ 2 < C inf lis — s. 



rn 



2 



and this either with high probability or in expectation, up to some small residual term and 
up to some multiplicative term C that could slightly increase with T. The previous equation 
(2.13) is an oracle inequality. If this oracle inequality holds, this will mean that we can select a 
model m, and consequently a projection estimator s = s^, that is almost as good as the best 
estimator in the family of the s m 's - whereas this best estimator cannot be guessed without 
knowing s. Of course this would tell us nothing if the projection estimators themselves, i.e. 
the s m 's, are not sensible. The next section precisely states the properties of the projection 
estimator and the oracle inequality satisfied by the penalized projection estimator. To conclude 
Section 2 we precise the different families of models we would like to use and we precisely 
explain what self-inhibition means in our model. 

2.3. Strategies. A strategy refers to the choice of the family of models A4t- In the sequel, a 
partition T of (0, A] should be understood as a set of disjoint intervals of (0, A] such that their 
union is the whole interval (0, A]. A regular partition is such that all its intervals have the same 
length. We say that a model m is written on T if all the extremities of the intervals in m are 
also extremities of intervals in T. For instance if T = {(0, 0.25], (0.25, 0.5], (0.50.75], (0.75, 1]} 
then {(0, 0.25], (0.25, 1]} or {(0, 0.25], (0.75, 1]} are models written on T. Now let us give some 
examples of families Mj*. Let J and N be two positive integers. 

Nested Strategy Take T a dyadic regular partition (i.e. such that |T| = 2 J ). Then take M. T 
as the set of all dyadic regular partitions of (0, A] that can be written on T, including 
the void set. In particular, note that #{M.t} = J + 2. We say that this strategy is 
nested since for any pair of partitions in this family, one of them is always written on 
the other one. 

Regular strategy Another natural strategy is to look at all the regular partitions of (0, A] 
until some finest partition of cardinal N. That is to say that one has exactly one model 
with cardinality k for each k in {0, N}. Here #{.Mt} = N + 1. 

Irregular strategy Assume now that we know that h is piecewise constant on (0, A] but 
that we do not know where the cuts of the resulting partition are. We can consider 
r a regular partition such that |T| = N and then consider Mt the set of all possible 
partitions written on T, including the void set. In this case #{A4t} — 2 N ■ 
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Fig 1. On each line, one can find a model by looking at the collection of red intervals between "[" or "] ". For 
the Nested Strategy, here are all the models for J = 3. For the Regular strategy, here are all the models for 
N = 4. For the Irregular and Islands strategies, these are just some examples of models in the family with 
N = 8. 



Islands strategy This last strategy has been especially designed to answer our biological 
problem. We think that h has a very localized support. The interval (0,A] is really 
large and in fact h is non zero on a really smaller interval or a union of really smaller 
intervals: the resulting model is sparse. We can consider V a regular partition such that 
|r| = N and then consider Mt the set of all the subsets of T. A typical m corresponds 
to a vectorial space S m where the functions g are zero on (0, A] except on some disjoints 
intervals which look like several "islands". In this case #{A4t} = % N - 

Figure 1 gives some more visual examples of the different strategies. 

2.4. Self-inhibition. The self-interaction can be modeled in a more general way by a pro- 
cess whose intensity is given by 

(2.14) \{t)=(v+( h(t-u)dN u 

\ ■' — oo 
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where h may now be negative. We have taken the positive part to ensure that the intensity 
remains positive. Then the condition f \h\ < 1 is sufficient to ensure the existence of a 
stationnary version of the process (see [7]). When h(d) is strictly positive there is a self- 
excitation at distance d. When h(d) is strictly negative, then there is a self-inhibition. It is 
more or less the same interpretation as above (see (1.1)) except that now all the previous 
occurrences are voting whether they "like" or "dislike" to have a new occurrence at position 
t. If this process is not studied in this paper from a theoretical point of view because of 
major technical issues (except in the remarks following Theorem 2), note that however our 
projection estimators, s m , and penalized projection estimators, s, do not take the sign of g or 
h into account for being computed. That is the reason why we will use our estimators, even 
in this case, for the numerical results. 



Finally we use in the sequel the notation □ which represents a positive function of the 
parameters that are written in indices. Each time is written in some equation, one should 
understand that there exists a positive function of such that the equation holds. Therefore 
the values of may change from line to line and even change in the same equation. When 
no index appears, □ represents a positive absolute constant. 



3. Main results. For technical reasons we are not able to carefully control the behavior 
of the projection estimators if v tends to or to infinity, but also if p (see (2.1)) tends to 1: in 
such cases, the number of points in the process is either exploding or vanishing. Consequently 
the theoretical results are proved within a subset of L 2 . Let us define for all real numbers 
H > 0, 77 > p > 0, 1 > P > 0, the following subset of L 2 : 

f = (p,g)GlL 2 / 



rV,P 
t ~H,P 



p G [p, ??], g{.) G [0, H] and J g(u)du < p\ . 



If we know that s belongs to CJ^p and if we know the parameters H, 77 and p, then it 
is reasonable to consider the clipped projection estimator, s m . If we denote the projection 



estimator s m 


— {pmi hrr 


t ) then s m = 


{v m , h m ) is given, for all positive t, by 










u m if p < v m < i], 








< 


p if v m < p, 










, 1 if v m > rj, 


(3.1) 


< 
















' Kn(t) if < h m {t) < H, 






h m (t) = 


< 


if h m (t) < 0, 










H if h m {t) > H. 
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Note that s m , the clipped version of s m , is only designed for theoretical purpose. Whereas s m 
may be computed even for possibly negative h, the computation of s m does not make sense in 
this more general framework. For the clipped projection estimator, we can prove the following 
result. 

Proposition 1. Let (NtjteR be a Hawkes' process with intensity given by Let m 

be a model written on T where T is a regular partition of (0, A] such that 



(3.2) |r| < -^-3. 

Then if s belongs to C^p, the clipped projection estimator on the model m satisfies 



E(|s m - s|| 2 ) < n H ,p, v , P ,A 



s 2 + m + l)-§— 



This result is a control of the risk of the clipped projection estimator on one model. A first 
interpretation is to assume that s belongs to S m . In this case, if m is fixed whereas T tends 
to infinity, Proposition 1 shows that s m is consistent as the maximum likelihood estimator is 
and that the rate of convergence is smaller than log(T)/T. It is well known that the MLE is 
asymptotically Gaussian in classical settings with a rate of convergence in 1/T. But the aim 
of Proposition 1 is not to investigate asymptotic properties: the virtue of the previous result 
is its non asymptotic nature. It allows a dependence of m on T, as soon as (3.2) is satisfied 
(see Section 6 for the resulting minimax properties). 

There are two terms in the upper bound. The first one ||s m — s|| 2 has already been identified 
as the bias of the projection estimator. The second term can be viewed as an upper bound 
for the stochastic or variance term. Actually this upper bound is almost sharp. If we assume 
that s belongs to S m , i.e. s = s m , then the bias disappears and the quantity E(||s m — s|| 2 ) - a 
pure variance term - is in fact upper bounded by a constant times \m\ log(T)/T. But on the 
other hand, we have the following result: 

Proposition 2. Let m be a model such that inf/ em £(7) > £q then there exists a positive 
constant c depending on A, 77, P, p, H such that if \m\ > c then 

inf sup E s (||s - s|| 2 ) > n H: p tV:P:A mm ( ^-,£ \m\ J . 
s ses m ncy p V 1 J 

The infimum over s represents the infimum over all the possible estimators constructed on 
the observation on [— A, T] of a point process (Nt)t- E s represents the expectation with respect 
to the stationnary Hawkes' process (Nt)t with intensity given by *& s (-)- 
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Hence, when s belongs to S m , the clipped projection estimator has a risk which is lower 
bounded by a constant times \m\/T and upper bounded by |m| log(T)/T. There is only a loss 
of a factor log(T) between the upper bound and the lower bound. This factor comes from the 
unboundedness of the intensity. The best control we can provide for the intensity is to bound 
it on [0, T] by something of the order log(T). The reader may think to this really similar 
fact: the sup of n i.i.d. variables with exponential moments can only be bounded with high 
probability by something of the order log(ra). Note also that the clipped projection estimator 
is minimax on S m D £pfp up to this logarithmic term. 

Now let us turn to model selection, oracle inequalities and penalty choices. As before if we 
know H, 7} and p, then it is reasonable to consider the clipped penalized projection estimator, 
s for theoretical purpose. Recall that the penalized projection estimator s = (i>, h) is given by 
(2.12). Then the clipped penalized projection estimator, s = (v, h), is given, for all positive 
t, by 



(3.3) 



v 
P 
V 



if 
if 
if 



p < v <r), 
v < p, 
v>r), 

< h(t) < H, 
h(t) < 0, 
hit) > H. 

The next Theorem provides an oracle inequality in expectation (see (2.13)). 





■ h{t) 


if 


h(t) = < 





if 




H 


if 



Theorem 1. Let (Nt)teR be a Hawkes' process with intensity ^ s (-)- Assume that we know 
that s belongs to CJ^fp. Moreover assume that all the models in Mt are written on T, a regular 
partition of (0,A] such that (3.2) holds. Let Q > 1. Then there exists a positive constant k 
depending on rj, p, P, A, H such that if 

log(T) 2 



(3.4) 

then 
Ef 



Vm G Mr, pen(m) = kQ(\itl\ + l)- 



< O v p p a,h inf 
m^Mr 



+ 



m 



+ 1) 



log(T) 2 



+ □ 



n,p,P,A,H- 



#{Mt} 

tq 1 



The form of the penalty is a constant times \m\ log(T) 2 /T, i.e. it is equal to the variance 
term up to some logarithmic factor. Remark also that choosing the penalty as a constant times 
the dimension leads to an oracle inequality in expectation. The multiplicative constant is not 
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an absolute constant but something that depends on all the parameters that were introduced 
(H,rj,P, etc). This is actually classical. Even in the Gaussian nested case (see [6]), Mallows' 
C p multiplicative constant is 2a 2 where a 2 is the variance of the Gaussian noise. The form 
is simpler than in our case but still an unknown parameter a 2 appears. With respect to the 
Gaussian case, remark that there is also some loss due to logarithmic terms. Finally, for readers 
who are familiar with model selection techniques, we do not refined the penalty with the use 
of weights, because the concentration formulas we use to derive the penalty expression are 
not concentrated enough to allow a real improvement by using those weights. The Gaussian 
concentration inequalities do not apply to Hawkes' processes, even if there are some attempts 
at proving similar results [22]. As a consequence, we are not able to treat families of models as 
complex as in [5]. This lack of concentration actually comes from an obvious essential feature 
of the Hawkes process: its dependency structure. This has already been noted in several papers 
on counting processes (see [20] and [21]). Here the dependance is not a nuisance parameter 
but the structure we want to estimate via the function h. Very related works may be found in 
discrete time: autoregressive process in [2] or [3] and Markov chain in [14]. In all these papers, 
multiplicative constants, that are usually unknown by practitioners, appear in the penalty 
term, as in the Gaussian framework, where the variance noise a 2 is usually unknown. In the 
Gaussian case there have been several papers dealing with the precise theoretical calibration of 
those constants in a data-driven way (see [6] or [1]). Here, since the concentration inequalities 
are too rough, we cannot prove theoretical calibration. So we have decided to find at least a 
practical data-driven calibration of this multiplicative constant (see Section 4) . 

4. Practical data-driven calibration via simulations. The main drawback of the 
previous theoretical results is that the multiplicative constant in the penalty is not computable 
in practice. Even if the formula for the factor k is known, it heavily depends on the extra 
knowledge of parameters (H, rj, P, etc) that cannot be guessed in practice. On the contrary, 
A is a meaningful quantity, at least for our biological purpose. The aim of this section is to 
find a performant implementable method of selection, based on the following theoretical fact: 
(3.4) proves that a constant times the dimension of the model should work. 

4.1. Compared methods. Since our simulation design (see Section 4.3) is computationnaly 
demanding, we restricted ourselves to models m with at most 15 intervals. Consequently, we 
did not consider the Nested Strategy because it would only involve five models in the family. We 
then only focus on the three following strategies: Regular, Irregular and Islands. Since we are 
looking for a penalty that is inspired by (3.4), we compare our penalized methods to the most 
naive approach, namely the hold-out procedure described below. As said in the Introduction 
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(Section 1), the log-likelihood contrast coupled with an AIC penalty (see for instance [12]) is 
only adapted to functions g defined on regular partitions, so we do not consider this method 



Moreover, the truncated estimators are designed for minimax theoretical purposes, but of 
course they depend on parameters (H etc.) that cannot be guessed in practice. They also 
force the estimate of h to be nonnegative. Therefore in this section we only use non truncated 
estimators (see (2.10), (2.11), (2.12)). 

Hold-out The naive approach is based on the following fact (which can be made completely 
and theoretically explicit in the self-exciting case). We know (see Lemma 3) that 77 1 is 
a contrast. We know also that E(7t(/)) = ||/ — — Moreover we know that the 
projection estimators s m behave nicely (see Proposition 1). Now we would like to select 
a model rh such that is as good as the best possible s m . So one way to select a good 
model m should be to observe a second independent Hawkes process with the same s 
and to compute the minimizer of 7T,2(s m ) over A4t (where s m is computed with the 
first process and 7t,2 is our contrast but computed with the second process). However 
we do not have in practice two independent Hawkes processes at our disposal. But one 
can cut [—A, T] in two almost independent pieces. Indeed the points of the process in 
[—A, T/2 — A] and in [T/2, T] can be equal to those of independent stationnary Hawkes 
processes and this with high probability (see [22]). Hence in the sequel whenever the 
Hold-out estimator is mentionned, and whatever the family A4t is, it is referring to the 
following procedure. 

1. Cut [— A, T] into two pieces: Hi refers to the points of the process on [—A, T/2 — A], 
H2 refers to the points of the process on [T/2, T]. 

2. Compute s m for all the m in Mt by minimizing the least-square contrast yr,i on 
S m computed with only the points of Hi, ie 



here. 





3 



Compute 7T,2(<s m ) where 7^2 is computed with H2, i.e., 





4 



and find rh = argmin mg ^ T Jr,2(s m ). 
The Hold-out estimator is defined by s 



HO ._ 2 „ 
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Penalized Theorem 1 shows that theoretically speaking a penalty of the type if (|m| + l) 
should work. However the theoretical multiplicative constant is not only not computable, 
it is also too large for practical purpose. So one needs to consider Theorem 1 as a result 
that guides our intuition towards the right shape of penalty and one should not consider 
it as a sacred and not improvable way of penalizing. Therefore we investigate two ways 
of calibrating the multiplicative constants. 

1. The first one follows the conclusions of [6]. In the Regular strategy, there exists at 
most one model per dimension. If there exists a true model mo, then for |m| large 
(larger than |mo|) 7T(s m ) should behave like — fc(|m| + l). So there is a "minimal 
penalty" as defined by Birge and Massart of the form pen min = £;(|m.| + l). In this 
situation their rule is to take pen(m) = 2 * pen min (m). 

We find a A; by doing a least-square regression for large values of \m\ so that 

7r(s m ) ^ -k(\m\ + l). 

Then we take 

m = argmin m6AlT 7 T (s m ) + 2k(\m\ + l), 
and we define s min := §rh- 

Let us remark that the framework of [6] is Gaussian and i.i.d. It is in our opinion 
completely out of reach to extend these theoretical results here. However, at least 
in the Regular strategy, the concentration formula that lies at the heart of our proof 
is really close to the one used in [6] (see (8.10)), which tends to prove that their 
method could work here. 

For the Irregular and Islands strategy, as a preliminary step, we need to find the 
best data-driven model per dimension i.e. 

m D = argmin meA/fTi | m | =D 7 T (s m ). 

Then one can plot as a function of D, 7T(<s»n D )- In [6], they also obtain another 
kind of minimal penalty of the form pen min = k(D + l)(log(|r|/Z?) + 5) when 
the Irregular strategy is used. But for very small values of |T| (as here) we would 
not see the difference between this form of penalty and the linear form. Moreover 
theoretically speaking we are not able to justify, even heuristically, such a form of 
penalty for large values of |T|. Indeed the concentration formula in our case (see 
(8.10) in the proofs Section 8) is quite different for such a complex family. For 
really complex families, the parameter x in (8.10) should depend on the model m. 
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Methods 


Strategy 


Selection 


1 


Regular N — 15 


minimal penalty s mm 


2 


Irregular \T\ = 15 


angle method s an9le 


3 


Irregular \T\ = 15 


minimal penalty g ml " 


4 


Islands \T\ = 15 


angle method s angle 


5 


/sZarads |T| = 15 


minimal penalty s mm 


6 


Regular N — 15 


Hold-Out S HU 


7 


Irregular \T\ = 15 


Hold-Out s HU 


8 


/stands |r| = 15 


Hold-Out s" u 



Table 1 
Table of the different methods. 



So we have decided that we will use the same penalty as before even in the Irregular 
and Islands strategies. That is to say that we find a k by doing a least-square 
regression for large value of D so that 

lT{Srh. D ) Oi -&(£> + 1). 

Then we take 

m = argmin mg _ MT 7 T (s fn ) + 2k(\m\ + l), 

and we define s min := s m even for the Irregular and Islands strategies. 

2. On the other hand, the choice of rh by s mm was not completely satisfactory when 
using the Islands or Irregular strategies (see the comments on the simulations 
hereafter). But on the contrast curve: D — > jT(sm D ), we could see a perfectly clear 
angle at the true dimension. So we have decided to compute — k = 7T Sr | r | Z4 Smi 
and to choose 

rh = arg min 7T(s m ) + k(\m\ + l). 

We define s angle := s A . This seems to be a proper automatic way to obtain this 
angle without having to look at the contrast curve. It is still based on the fact that 
a multiple of the dimension should work. This has only been implemented for the 
Irregular and Islands strategies. 

This angle method may be viewed as the "extension" of the L-curve method in 
inverse problems where one chooses the tuning parameter at the point of highest 
curvature. 

Table 1 summarizes our 8 different estimators. 
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4.2. Simulated design. We have simulated Hawkes processes with parameters (u, h), with v 
in {0.001,0.002,0.003,0.004,0.005}, h having a bounded support in (0, 1000] (i.e. A = 1000) 
and on a sequence of length [— A, T] with T = 100000 or T = 500000. The fact that the 
process is or not stationnary does not seem to influence our procedure with this relatively 
short memory (indeed T > 100^4). 

The functions h have been designed so that we can see the influence of p (2.1) on the 
estimation procedure. So fi = 0.0041 [200,400] is a piecewise constant nonnegative function on 
the regular partition T (\T\ = 15) with integral 0.8 and we have tested h = c* f± with c in 
{0.25,0.5,0.75,1} (i.e. p = 0.2,0.4,0.6 and 0.8 respectively). We have also tested a possibly 
negative function /2 = 0.0031 [200,800/3] ~~ 0.0031 [2000/3,2200/3] that is piecewise constant on 
T. Note that (see Section 2.4) the sign of h should not affect the method (penalized least- 
square criterion) whereas the log-likelihood may have some problems each time ^/Q remains 
negative on a large interval. The parameter of importance here is the integral of the absolute 
value, which is here f I/2I = 0.8 and we have tested h = fi- Finally the method itself should 
not be affected by a smooth function h: we have used /3 a nonnegative continuous function 
(in fact the mixture of two Gaussian densities) with integral equal to 0.8 and we have tested 
once again h = fy. 

Remark that the mean number of observed points belongs to [125, 12500] which corresponds 
to the number of occurrences we could observe in biological data. 



4.3. Implementation. The minimization of 7^ is actually quite easy since we use a least- 
square contrast. From a matrix point of view, one can associate to some / in S m (see (2.9)) 
a vector of-D + l = |m| + l coordinates 

et/i 
\ a i D J 

where Ii, . . . , Id represent the successive intervals of the model m. Let us introduce 



T N [0,T] 



\ 



T'/o T *(o,lx 1 )(*)^ 
\Tlo^(o,i lD )(t)dN t J 
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and 

1 ^J^ (0tlli) (t)dt ... ±jTy i0>llD) (t)dt \ 



X m 



V?'/o r *(o,i^)(*)^ T/o^(o,i, 1 )( i ) $ (o,i lD )W* ••• H T *(o,i, D )W dt y 

It is not difficult to see that the constrat 7t(/) can be written like 

lT{f) = -26> m b m + t 9mX m v 



Therefore, the minimizer m of 7r(/) over / in S m satisfies X m # m = b m , i.e. 9 m = ~K^h m . 
Since the functions ^ r (o,i J )(*) are piecewise constants, despite their randomness, it may be 
long but not that difficult to compute X m . It is also possible to compute Xr and to deduce 
from it the different X m 's, when one uses the Islands or Irregular strategies. Nevertheless, 
both Islands and Irregular Strategies require to calculate each vector 9 m for the 2' r l possible 
models m and to store them to evaluate the oracle risk (see below). We thus restricted our 
Monte Carlo simulations to models m with less than 15 intervals. For the analysis of single 
real data sets, the technical limitation of our programs is |T| = 26 due to the 2l r l possible 
models. The programs have been implemented in R and are available upon request. 

4.4. Results. The quality of the estimation procedures is measured thanks to two criteria: 
the risk of the estimators and the associated oracle ratio. 

• We call Risk of an estimator the Mean Square Error of this estimator over 100 simu- 
lations, i.e. we compute for each simulation ||s — s|| 2 and next we compute the average 
over 100 simulations. Note that with the range of our parameters, the error of estima- 
tion of v will be really negligible with respect to the error of estimation for h, so that 
\\s-s\\^f A (h-hf. 

• The Oracle Risk is for each method the minimal risk, i.e. min me _A/f T Risk(s m ). All our 
methods give an estimator s that is selected among a family of s m 's. The Oracle ratio 
is the ratio of the risk of s divided by the Oracle Risk, i.e. 

Risk(s) 



min mG _A/f T Risk{s r 



If the oracle ratio is 1, then the risk of s is the one of the best estimator in the family. 
Note that the definition of Mt and even the definition of s m appearing in the Oracle 
ratio may change from one method to another one. 
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Fig 2. Risk of the 8 different methods for h = 0.5 * fi for different values of v and T. 

Figure 2 gives the Risk of our estimators for h = 0.5 * f\ for various u and T. We first 
clearly see that the risk decreases when T increases whatever the method. Then we see that 
the "best methods" are Methods 1, 2 and 4, i.e. the Regular strategy with minimal penalty 
and the Irregular and Islands Strategies with the angle method. For the Irregular and Islands 
Strategies, the minimal penalty seems to behave like the Hold-out Strategies. There seems also 
to be a slight improvement when v becomes larger, tending to prove that, if the mean total 
number of points E(iV[0, T]) = vT/{\ — p) grows, the estimation is improved - at least in our 
range of parameters. Figure 3 gives the Oracle Ratio of our estimators in the same context. 
The oracle ratio is really close to 1 for Methods 1, 2 and 4 when T = 500000 whatever v is. 
Remark that the Oracle Ratio for the Hold-Out estimators (Methods 6, 7 and 8) is not that 
large but since the estimators s m are computed with half of the data, their risks are not as 
small as the projection estimators used in the penalty methods. This explains why the Risk 
of the Hold-Out methods is large when the Oracle Ratio is close to 1. The Oracle Ratio is 
improving when T becomes larger for our three favorite methods (namely 1, 2, 4). 

Figure 4 gives the variation of the risk with respect to p (2.1). Since h = c* f\ and since c 
varies, the Rescaled Risk, Risk/c 2 , gives (up to some negligible term corresponding to v) the 
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Fig 3. Oracle ratio of the 8 different methods for h = 0.5 * fi for different values of v and T. 
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Fig 4. Rescaled Risk (Risk /c 2 ) of the 8 different methods for h = c*/i and v = 0.001, for different values of 
c and T . 
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Fig 5. Oracle ratio of our estimators for h = c * /i and v = 0.001 for different values of c and T (top). The 
bottom picture zooms in on the top picture for oracle ratio between 1 and 2. 
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Fig 6. Frequency of the chosen dimension \rh\ + 1 for the different methods when T = 500000, v — 0.001 and 
h — 0.5 * /i. Note that the true dimension is 6 for the Regular method (chosen in 100% of the simulations 
by Method 1) and 4 for the Irregular and Islands methods (chosen in more than 95% of the simulations by 
Methods 2 and 4 ) 
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Fig 7. Contrast ( C) and penalized contrast (PC) as a function of the dimension for the three favorite methods 
on one simulation with T — 500000, v = 0.001 and h = 0.5 * /i. The chosen estimators (PE) are in blue 
whereas the function h — 0.5 * /i is in red. 

risk of h/c as an estimator of f\. We clearly see that when T or c becomes larger the Rescaled 
Risk is decreasing. So it definitely seems that if the mean total number of points grows, the 
estimation is improving. Methods 1, 2 and 4 seem to be still the more precise ones. Figure 
5 gives the oracle ratio in the same situation. Once again there is an improvement when T 
grows at least for our three favorite methods (1, 2 and 4) and the Oracle ratio is 1 when 
T = 500000 and c = 0.8. The same comment about a good Oracle Ratio for the Hold-out 
methods apply. 

Figure 6 gives the frequency of the chosen dimension, namely \m\ + 1 for the different 
methods. Clearly Methods 1, 2 and 4 are correctly choosing the true dimension in most of 
the simulations when the other methods overestimate the true dimension. 

Finally Figure 7 shows the resulting estimators of Methods 1, 2 and 4 on one simulation. 
In particular, before penalizing, note that one clearly sees an angle on the contrast curve 



24 



P. REYNAUD-BOURET AND S. SCHBATH 



at the true dimension and that penalizing by the angle method (Methods 2 and 4) gives an 
automatic way to find the position of this angle. 

Figure 8 shows the results for the possibly negative function /2 and only for our three 
favorite methods (1, 2, 4). For this function only and because the true dimension is 16 for 
Method 1, we use for Method 1 , |T| = 25. Note that (i) Method 1 and 4 select the right 
dimension whereas Method 2 (Irregular Strategy) does not see the negative jump and that (ii) 
it is also more easy to detect the precise position of the fluctuations on the sparse estimate 
given by Method 4 (compared to Method 1). For sake of simplicity we do not give the risk 
values, but it is sufficient to note that, for all the methods, they are small (with a slight 
advantage for Method 4) and that the oracle ratios are close to 1. 

Figure 9 gives the same results for the smooth function f%. Of course since the projection 
estimators are piecewise constant, they cannot look really close to fy. But in any case, Method 
1 and more interestingly Method 4 gives the right position for the spikes whereas Method 2 
does not see the smallest bump. 

Finally let us conclude the simulations by noting that the penalized projection estimators 
with the Islands strategy and the angle penalty (Method 4) seems to be an appropriate method 
for detecting local spikes and bumps in the function h and even negative jumps. 

5. Applications on real data. We have applied the penalized (angle method) estima- 
tion procedure with the Island strategy (Method 4) to two data sets related to occurrences of 
genes or DNA motifs along both strands of the complete genome of the bacterium Escherichia 
coli (T = 9 288442). In both cases, we used A = 10 000 as the longest dependence between 
events and the finest partition corresponds to |T| = 15. 

The first process corresponds to the occurrences of the 4290 genes. Figure 10 (top) gives the 
associated contrast and penalized contrast, together with the chosen estimator of h (fh = 4 
and v = 3.64 10 -4 ). The shape of this estimator tells us that 

• gene occurrences seem to be uncorrelated down to 2600 basepairs, 

• they are avoided at a short distance (~ 0-500 bps) and 

• favored at distances ~ 700-2000 bps apart. 

This general trend has been refined by shortening the support A to 5000 and then to 2000 
(see Figure 11). It then clearly appears both a negative effect at distances less than 250 bps, 
and a positive one around 1000 bps. This is completely coherent with biological observations: 
genes on the same strand do not usually overlap, they are about 1000 bps long in average, 
and there are few intergenic regions along bacterial genomes (compact genomes). 

The second process corresponds to the 1036 occurrences of the DNA motif tataat. Figure 
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Fig 10. Contrasts, penalized contrasts and chosen estimators for both E. colt datasets 



10 (bottom) gives the associated contrast and penalized contrast, together with the chosen 
estimator of h (m = 5 and u = 7.82 10~ 5 ). The shape of the estimator suggests that 

• occurrences seem to be uncorrelated down to 4000 basepairs, 

• favored at distances ~ 0-1500 bps and 3000 bps apart, 

• highly favored at a short distance apart (less than 600 bps). 

After shortenning the support A to 5000 (see Figure 11), the shape of the chosen estimator 
shows that there actually are 3 types of favored distances: very short distances (less than 300 
bps), around 1000 bps and around 3500 bps. This trend is again coherent with the fact that 
(i) the motif tataat is self-overlapping (two successive occurrences can occur at a distance 
5 apart), (ii) this motif is part of the most common promoter of E. coli meaning that it 
should occur in front of the majority of the genes (and these genes seem to be favored at 
distances around 1000 bps apart from the previous example), (hi) some particular successive 
genes (operons) can be regulated by the same promoter (this could explain the third bump). 

Figure 12 presents the results of the FADO procedure [12]. Here we have forced the estima- 
tors to be piecewise constant to make the comparison easier. Note however that the FADO 
procedure may be implemented with splines of any fixed degree. 

Our results are in agreement with the ones obtained by FADO. Compared to the later, our 
new approach has two advantages. First, it gives a better idea of the support A of the function 
h: indeed, the estimator provided by FADO (cf. Fig.l2-top-left) has some fluctuations until 
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Fig 11. Chosen estimators for both E. coli datasets for different values of A: A = 5000 (left, right) and 
A = 2000 (middle). 

the end of the interval whereas our estimator (cf. Fig. 10-top-right) points out that nothing 
significant happens after 3000 bps. Second, our method leads to models of smaller dimension 
(|m| = 4 for Islands versus |m| = 15 for FADO). The limitation of our method is essentially 
that we only consider piecewise constant estimators, but this is enough to get a general trend 
on favored or avoided distances within a point process. 

6. Minimax properties. The theoretical procedures of Proposition 1 and Theorem 1 
have more theoretical properties than just an oracle inequality. This section provides their 
minimax properties. In particular, even if it has not been implemented for technical reasons 
that were described above, the Nested strategy leads to an adaptive minimax estimator. Such 
kind of estimators were not known in the Hawkes model, as far as we know. 

6.1. Holderian functions. First, one can prove the following lower bound. 

Proposition 3. Let L > and 1 > a > 0. Let 

n L , a = {s = (u,h)eh 2 / Vx,y€(0,A], \h(x)-h(y)\<L\x-y\ a ). 

Then 

inf sup E s (\\s - sf) > D H ,p,A, v ,p, a mm (l^T'^,1) . 
s seH L , a nC™ P v ' 

The infimum over s represents the infimum over all the possible estimators constructed on 

the observation on [— A, T] of a point process (N t )t- E s represents the expectation with respect 

to the stationnary Hawkes' process (Nt) with intensity given by ^? s (-)- 

But on the other hand, let us consider the clipped projection estimator s m with m a regular 
partition of (0, A] such that 

H ~ (T/iogCr)) 1 /^ 1 ). 
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Fig 12. FADO estimators for both E. coli datasets for different values of A: A = 10000 (left) and A — 2000 
(right) for genes or A — 5000 (right) for tataat . 



If the function h is in Hh,a^ ^jfp with o, G (1/2, 1], then, applying Proposition 1, s m satisfies 

2o/(2a+l) 



(lo (T'] \ ' 



Compared with the lower bound of the minimax risk (Proposition 3), we only lose a logarith- 
mic factor: the clipped projection estimators are minimax on %L,a H ^jfp, with a G (1/2, 1], 
up to some logarithmic term. We cannot go beyond a = 1/2 because one needs \m\ « y/T 
in Proposition 1. 

Of course, we need to know a to find s m , so s m is not adaptive with respect to a. But 
the clipped penalized projection estimator s with the Nested strategy can be adaptive with 
respect to a. It is sufficient to take J ~ log 2 (VT/ log(T) 3 ) to guarantee (3.2). Then we apply 
Theorem 1 with Q = 1.1 for instance. Since #{Mt} is of the order log(T), we obtain that 



E - s|| < n H „, P ,A,p mf 

m£MT 



\s - s m \\ 2 + (|m| + 1 



log(T) 2 



If h is in 7ii,a H >C^ p with a G (1/2, 1], then there exists m in A4p such that 



|m| ~ (T/\og(Ty) 



2U/(2a+l) 
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and consequently 

o /log(T) 2 \ 2a/(2a+1) 



IE s-s < □ 



H,r],P,p,A,L,a 



/logpry 



Therefore, the clipped penalized projection estimator s with the Nested strategy and the 
theoretical penalty given by (3.4) is adaptive minimax on \jiL,a H £ r jfp, a G (1/2, 1]} up 
to some logarithmic term. 

6.2. Irregular and Islands sets. Let us apply Theorem 1 to the Irregular strategy and 
Islands strategy. In both cases, the limiting factor here is #{A4t}- Take N < log 2 (r), then 
#{M T ] < T and if Q > 2 we obtain that 

log(T) 2 ^ 



E(\\s-s\\y <n H ,p,A, v , P mf 

meMr 



||s-s m || + (|m| + 1)- 



T 

To measure performances of those estimators, one needs to introduce a set of sparse functions 
h, functions that are difficult to estimate with a Nested strategy. A piecewise function h is 
usually thought as sparse if the resulting partition is irregular with few intervals. So we define 
the Irregular set by: 

(6-1) ^F,D - = partition written on F ,\m\= D ^ m ■) 

Then, if s belongs to S'f r D , using the Irregular strategy, the clipped penalized projection 
estimator satisfies 

E(\\s-s\\y <0 H , P , VtPA D- 

But for our biological purpose, the sparsity lies in the support of h. So we define the Islands 
set by 

(6-2) 5i?£> := U mcr | m | =Z) 5. m . 

Then, if s belongs to S^rj, using the Islands strategy, the clipped penalized projection esti- 
mator also satisfies 



l2 _ ^log(T) 2 



E(||s- S ||) 2 < H p„„ a D 



log(T) 2 



On the other hand it is possible to compute lower bounds for the minimax risk over those 
sets. 

Proposition 4. Let T be a partition of (0,A] such that inf/ gr ^(/) > £q- Let \Y\ = N 
and let D be a positive integer such that N > 4D. If D > C2{A,n,P, p,H) > 1, for c 2 some 
positive constant depending on A, n, P, p, H , then 



inf sup E s (||s - s\\ ) > U H ,p,A,ri,p min ,D£ 
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and 

inf sup E s (|s - s\\ 2 ) > n HtPA ^ p mm { - ^| ^ , D£ ) . 

s 'esjftnjqgp V 1 J 

The infimum over s represents the infimum over all the possible estimators constructed on 
the observation on \—A,T\ of a point process (Ntjf represents the expectation with respect 
to the stationnary Hawkes' process (N t ) with intensity given by ^ s (-)- 

To clarify the situation it is better to take N = \T\ ~ log(T). If D ~ log(T) a with a < 1 then 
the lower bound on the minimax risk is of the order log(T) a log log T /T when the risk of the 
clipped penalized projection estimator (for both strategies) is upper bounded by log{T) a+2 /T, 
and this whatever a is. So our estimator matches the rate 1/T up to a logarithmic term. Of 
course the most fundamental part is this logarithmic term. Think however that there exists 
some function h in those sets, such that the function belongs to Sp but to none of the other 
spaces S m for m in the family Mt described by the Nested Strategy. Consequently, a clipped 
penalized estimator with the Nested Strategy would have an upper bound on the risk of the 
order log(T) 3 /T, by applying Theorem 1. So the Irregular and Islands strategies have not 
only good practical properties, but there is also definitely a theoretical improvement in the 
upper bound of the risk. 

7. Technical results. 

7.1. Oracle inequality in probability. The following result is actually the one at the origin 
of Theorem 1. Note that this result holds for the practical estimator, s, which is not clipped. 

Theorem 2. Let {Ntjt^M. be a Hawkes' process with intensity V^sO- Let H , r] and A be 
positive known constants such that s = (v,h) satisfies v G [0, ??] and h(-) G [0, H\. 
Moreover assume that the family Mt satisfies 

inf inf 1(1) > £ > 0. 

Let S be a finite vectorial subspace of L 2 containing all the piecewise constant functions 
constructed on the models of Mt- Let R > r > be positive real numbers, let M be a positive 
integer and let us consider the following event 

B={vte[0,T], N([t-A,t))<M and V/ G S, r 2 \\f\\ 2 < D 2 T (f) < R 2 \\f\ 2 }, 

where N([t — A, i)) represents the number of points of the Hawkes process (Nt)t in the interval 
[t— A,t). We set A = (rj+HJ\f)R 2 jr 2 and we considers andx any arbitrary positive constants. 
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If for all m 6 Mt 



o . \m\ + 1 , / — n9 



pen(m) > (1 + efK— — (1 + 3V2x 



then there exists an event £l x with probability larger than 1 — 3#{Mt}& x such that for all 
m G Mt, both following inequalities hold : 

sr 2 

(7.1) Y+l^ ~ s||2len ^ - (1 + £ ) D t( s ™ ~ s) + (1 + - s±) + r 2 ||s ± - s|| 2 + 

^ 2 .2 / / x n A l+N*lto 2 
+ « - s m + (1 + e)pen(m) + D £ -x + D e 5— = — x , 

where s± denotes the orthogonal projection for f.f o/s on 5, and 

e 



-Uts2 , 2 + £ ^„„ „ „2 , , ^ \ 1 1 — 1 a , m 1+AA 2 /V 



•2 + e + e-^K 2 + -—r 2 \\s - s m \\ 2 + (1 + e)pen(m) + D e A- + D e ^pr^x 2 , 

1 + e / 1 r z l z 

where K is a positive constant depending on s such that I/Id < AT|/| for all f in L 2 (see 
Lemma 2). 

Remark 1 This result is really the most fundamental to understand how the Hawkes' process 
can be easily handled once we only focus on a nice event, namely B. We have "hidden" in B 
the fact that the intensity of the process is unbounded: on B, the number of points per interval 
of length A is controlled, so the intensity is bounded on this event. We have also "hidden" 
in B the fact that we are working with a natural norm, namely Dt, which is random and 
which may eventually behave badly: on B, Dt is equivalent to the deterministic norm ||.|| for 
functions in S. More precisely the result of (7.1) mixes ||.|| and Dt(-) but holds in probability. 
On the contrary, (7.2) is weaker but more readable since it holds in expectation with only 
one norm ||.||. Note also that B is observable, so if one observes that we are on B, (7.2) shows 
that a penalty of the type a factor times the dimension can work really well to select the 
right dimension. Indeed, note that if, in the family Mt, there is a "true" model m (meaning 
that s = s m ) and if the penalty is correctly chosen, then (7.2) proves that \\s — s\\ 2 is of the 
same order as the lower bound on the minimax risk on m, namely \m\/T (see Proposition 
2 for the precise lower bound). In that sense, this is an oracle inequality. The procedure is 
adaptive because it can select the right model without knowing it. But of course this hides 
something of importance. If B is not that frequent, then the result is completely useless from 
a theoretical point of view since one cannot guarantee that the risk of the penalized estimator 
and even the risk of the projection estimators themselves are small. 
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Remark 2 In fact, we will see in the next subsection that the choices of M, R, r, Mt are really 
important to control B. In particular we are not able at the end to manage families of models 
with a very high complexity as in [5] or in most of the other works in model selection (see 
Theorem 1 and Section 6). This is probably due to a lack of independency and boundedness 
in the process itself. 

Remark 3 Note also that the oracle inequality in probability (7.1) of Theorem 2 remains 
true for the more general process defined by (2.14) once we replace B by B n B' where 
B' = {Vi < T,X(t) > 0}. But of course then, B' is not observable. This tends to prove that 
even in case of self-inhibition a penalty of the type a constant times the dimension is working. 

7.2. Control of B. The assumptions of Theorem 1 are in fact a direct consequence of the 
assumptions needed to control B, as shown in the following result. 

Proposition 5. Let s £ CJ\f p and R and r such that 

r2 > 2max (TTp)2 M + (i - Py 1 )) and r2 < min (f . ^tt) • 

Moreover let 

61og(T) 



TV 



P - log P - 1 

Let us finally assume that S, defined by Theorem 2, is included in Sr where T is a regular 
partition of (0, A] such that 

Vt 
T < — 

' (logT)3 

Then, under the assumptions of Theorem 2, there exists Tq > depending on rj, p, P, A, R and 
r, such that for all T > To, 

p(£ c ) < n v , PA ±. 

These technical results imply very easily Proposition 1 and Theorem 1. 

Proof. [Theorem 1] We apply (7.2) of Theorem 2 to s. Since s is closer to s than s, the 
inequality is also true for s. We choose x = Qlog(T) and M, R,r according to Proposition 5. 
On the complementary of B n £l x , we bound \\s — s\\ by rj 2 + H 2 A and the probability of the 
complementary of the event by 

/ 1 #{M T } 

U v ,P,A,p,H [7^2^ 

The same control may be applied if T is not large enough. To complete the proof, note finally 
that K < d Vi p )A . □ 
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Proof. [Proposition 1] We can apply Theorem 2 to a family that is reduced to only one 
model m. If the inequality is true for the non truncated estimator and if we know the bounds 
on s then the inequality is necessarily true for the truncated estimator, which is closer to s 
than s. Then the penalty is not needed to compute the estimator but it appears nevertheless 
in both oracle inequalities. We can conclude by similar arguments as Theorem 1, but if we take 
x = log(T) in (7.2), we lose a logarithmic factor with respect to Proposition 1. We actually 
obtain Proposition 1 by integrating also in x the oracle inequality in probability (7.1) and we 
conclude by similar arguments, using that \.\d < -^l-l- D 

8. Proofs of the technical and minimax results. 



8.1. Contrast and norm. First let us begin with a result that makes clear the link between 
the classical properties of the Hawkes process (namely the Bartlett spectrum) and the quantity 
J g 2 that is appearing in the definition of the L 2 space (2.3). 

Lemma 1. Let (Nt)teR be a Hawkes process with intensity *& s (-)- Let g be a function on 
K+ such that g{u)du is finite. Then for all t, 



E 



t 

g(t - u)dN u 

-oo 



{1-pf 



g{u)du) + / \Fg(-w)\ f N (w)dw 



< 



g{u)du I + 



where 



{1-pf 



In(w) = ; -o, 

2vr(l - p) 1 - Fh{w)\ 



{1-pf 



+oo 



g 2 {u)du, 



is the spectral density of (Nt)t£R. 

Remark (Notation): Th is the Fourier transform of h, i.e. Fh{x) = j^e txt h(t)dt. 
Proof. Let 4> t {u) = l u <tg(t - u). We know (see [8, p 123]) that 



Var 



(j>t{u)dN u 



\F4> t {w)\ f N {w)dw. 



Moreover, since g has a positive support, F(f>t(w) = e J-g(-w). Hence 



Var 



4>t{u)dN u 



\Fg(-w)\ fN(w)dw. 



But we also know that (see [13]) 
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Consequently 

E 



t 

g(t - u)dN u 

-00 



Var 



Var 



<j) t (u)dN u 



</h(u)dN u 



+ E 



+ A 



Mu)dN u 



g{u)du 



o 



which gives the first part of the Lemma. The second part is due to Plancherel's identity, which 
states 

rA 



(8.1) 



\Fg{— w)\ dw = 2ir / g (x)dx, 



o 



and the fact that /jv is upper bounded by v/[2tt{1 — p) 3 ] since h is nonnegative. 



□ 



Lemma 1 is at the root of Lemma 2, which gives the equivalence between the L 2 -norms, 
||.|| and ||.||d) equivalence that is essential for our analysis. Lemma 1 essentially represents the 
main feature of the lengthy but necessary computations of Lemma 2. 

Lemma 2. The functional is a quadratic form on L 2 and its expectation f.! 2 -, (see 
{2.8)) is the square of a norm on L 2 satisfying 



(8.2) 

where 



V/GL 2 , L\\f\\<\\f\\ D <K\\ 



K 2 = 2 max 



uA + 



1-p 



and 1? = min 



v 1 — p 
4' 8Az/ + 1 



Proof. D 2 -, is a quadratic form since y J* Q T ^ / j(t)^'fc(i)(iA r f, its associated form, is bilinear 
and symmetric in / and k. 
Moreover one can compute \\f\\j)- if / = (m, ff), 

21 



h 1 



[i+ / g{t-u)dN u 



dt 



W E /i 2 + 2fi [ g(t - u)dN u +( [ git- u)dN u 

1 JO J-oo \J-oo 



dt 



L[ T L 2 + 2ti\[ g(t-u)du + E ( [ g 

1 Jo V J-oo \J-oo 

H 2 + 2/iA / g{u)du + - / I 
JO -t Jo 



(t - u)dN u 



dt 



g(t - u)dN u 



dt. 
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Let us use the first part of Lemma 1. This gives that 
(8.3) 

r+oo / f+oo \ 2 

\\f\\ 2 D = E(D 2 (/)) =V 2 + 2/xA / g(u)du + X 2 / g{u)du + / \Fg{-w)\ 2 f N {w)dw. 

JO \J0 / JR 

This is a quadratic norm on L 2 : indeed, l/lf} > and its associated form 

(8-4) V/, k G L 2 , + kf D - \\f\\l - \\k\\ 2 D ) = 1e [ jf U f (t)9 h (t)dN t 

is bilinear and symmetric. It remains to prove that \\f\\ 2 j = implies / = 0, which is automatic 
if we prove (8.2). Refining the computations of Lemma 1, one can easily check that for all w 

(8.5) 0<c= -<f N ( w )<C= o. 

2vr(l-p)(l+p) 2 " 2vr(l-p) 3 

By (8.3) and (8.5), one has 

A \ 2 



fi + X g(x)dx) +c / |J r S'(— u>)| 2 dw < \\f\\ 2 £, < lfi + X g(x)dx ) +C / | J r <7(— w)| 2 dr(;. 

Jo / Jr \ Jo / Jr 

By Plancherel's identity (8.1), 

-A \ 2 r+oo / i-A \ 2 



/ M \ z r+oo / /-A \ z /•- 

U + Ay g(x)dx\ +2vrc y 5 2 (cc)dx < ||/|||, < ^ + X J 9{x)dx\ +2vrC y 



oo 

g 2 (x)dx. 



For the upper bound, remark that 



(^ + X J g(x)dxj <2 / u 2 + 2A 2 Q| g(x)dx 



which implies by Cauchy-Schwarz' inequality that 

»A \ 2 



(fi + xj g(x)dx\ <2j? + 2X 2 A J g 2 (x)dx. 



So K 2 = max(2, 2X 2 A + 2ttC) works. 
• For the lower bound, one has for all 9 > 



^ + A^ g{x)dxj >(l-0)fi 2 +(^l-^jX 2 ^j\(x)dx S 



Then if < 1, this implies, by Cauchy-Schwarz' inequality, that 

fi + A g(x)dx^j > (1 - 0V 2 + (l - ^) ^ ST 



(x)dx. 
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Hence we obtain for all < 9 < 1 



2vrc + 1 - - ) AX 



\\l>(i-e)n 2 



Taking 6 such that 2ttc + (1 — h)AX 2 = ttc, this implies that 



g 2 {x)dx. 



> 



TTC 



But ire > z//4 and 



— + vrc [ g 2 {x)dx. 

AX Z + TTC J 

1 — p 



TTC 



> 



AX 2 + ttc 8Au + 1 ' 
Hence L 2 = min (f/4, (1 — p){SAu + l) -1 ) works. 

Lemma 2 has a direct corollary: -fx defines a contrast. 



□ 



Lemma 3. Let (Nt)t£R be a Hawkes process with intensity *& s (.). Then the functional 
given by 

V/GL 2 , 7r (f) = -lj 9 f (t)dN t + ^J V f (t) 2 dt, 

is a contrast, i.e. E(7y(/)J is minimal for f = s. 

Proof. Let us compute E(7t(/)). As A(t) = ^ / s (i), one can write by the martingale 
properties of dN t - * s {t)dt and by (8.4) that 



E( 7T (/)) = E 
= E 



T 



*f(t)dN t 



+ E{D 2 (f)) 



+ 



\\l 



II 2 || N 2 

S \\D ~ \\ s \\d- 



Consequently E(7t(/)) is minimal when / = s since Lemma 2 proves that ||.||£> is a norm. □ 

8.2. Proof of Theorem 2. This proof is quite classical in model selection. It heavily depends 
on a concentration inequality for x 2 -type statistics that has been derived in [20] and which 
holds for any counting process. The main feature is to use the martingale properties of Nt — 
J* * X(u)du (see (1.1)). We do not need any further properties of the Hawkes process to obtain 
(7.1) (see Remark 3). 

We emphasize that 
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1. the oracle inequalities of Theorem 2 hold for s the practical estimator and not only the 
clipped one, and 

2. that (7.1) holds for possible negative function h up to a minor correction (see Remark 
4 at the end of the proof). 

Proof. Let m be a fixed partition of Mt- By construction, we obtain 

(8.6) 7t(s) + pen(m) < 7r(s m ) + pen(?n) < ^ T (s m ) + pen(m). 

Let us denote for all / in L 2 , 



1 



T 



Mf) = f J Q Vf{t)(dN t -*,(t)dt), 

which is linear in /. Then (2.6) becomes 7t(/) = Dj.(f — s) — Dj,{s) — 2vx(f) and (8.6) leads 
to 

(8.7) Dj<{s — s) < D^(s m — s) + 2vt{s — s m ) + pen(?n) — pen(m). 

By linearity of vt, vt{s — s m ) = vt{s — Sm) + v T{sm — s m ). Now let us control each term in 
the right hand side of (8.7). 

1. Let us begin with A\ = 2ut(s — Sm)- For all m' in Mt-, we set 

(8.8) W m , = sup 

/GS m , II J II 

Thus A\ < 2 1| s — Sml^Fm- Therefore, for all 8 > 0, one has the following upper bound 

(8.9) A 1 <e\\s-s lh f + -Wl. 

Now we need to control Wm which is doubly random: for fixed m, W m is random but 
the choice rh is random too. So one needs to control each W m /'s to control W m . 
To do so, we first need to find a simpler form for W m i. Note that 

lll - 0)}u l{°'7m) Jem '} 

is an orthonormal basis of S m i for ||.||. For all I £ m', let us denote 

Ni(t) = * (0>ll) (t). 
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Then 



W„ 



sup 

M 2 +E/6m' a / = 1 

sup 



(diVt - * s (t)dt) 



F ±-(dN t -y S (t)dt)) + j2 { r 



(dN t - * a (t)dt) 



r ±(dN t -*.(t)dt)) +e if 



N z (t) 



(dN t - *s(t)dt) . 



^ f (u) 2 du > R 



2 II t\\2 



Let T be defined by 

T=|t>0^ N([t - A,t)) > TV or 3feS, IjT* 

and let r be the stopping time defined by 

r = inf{t > 0,4 G T}. 

It is quite easy to see that if £ belongs to T then there exists £' < £ such that t' belongs 
to T. Hence r does not belong to T and since Jjj ^ f(u) 2 du is increasing in £, saying 
that we restrict ourselves to B implies that r > T. Finally we can write that on B, 
W m i = Z m i defined by 



Written in this way, this is a x 2 ~type statistics as defined in [20], since the iVj(.)'s are 
predictable processes and so is tt< T - So Corollary 2 of [20] gives that with probability 
larger than 1 — 2e~ x , 

Z m / < \/Crn' + 3\/2vx + 6x 

where 



T 2 ^ T 2 £(I) 



l t < T $> s (t)dt, v = \\C„ 



and where b is a deterministic constant that should satisfy 

N 2 (t) 



b 2 > l t<7 



1 \ - 



T 2 ^ T 2 £(I) 
iem' y ' 
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But on {r > T} one has for all t < T 

<r] + HM. 

So 



C m ,<(rj + HN)UD 2 T {(1,0)) + Y^ 
\ Jem' 



^t((0, 1/))' 



M J 

Moreover on {r > T}, for all / in S, D^(f) < i? 2 ||/|| 2 , this implies that 

C m , <( V + HM)R 21 -^]^. 

Using £q, one has that b 2 = (1 + J\f 2 /£q)/T 2 works. Finally, on B, with probability 
larger than 1 - 2#{_Mr}e _a: , 



•10) W„ < J ft + HM)R 2 ^ (l + 3v^) + V±+^Jl 



^0 

— X. 

T 



Let us fix some positive numbers 6 and e that will be chosen later and let us go back 

to A\. We obtain the following upper bound 

(8.11) 

A X < e\\s-s A f+l [(1 + e)( v + HAO^J^Ltl (i + 3v ^) 2 + (1 + ^ V+^V 

inequality which holds on i3 with probability larger than 1 — 2#{A4x}e~ x . 
2. Let us control now A2 = 2vT(s r - n — s m ). To do so, we need to control all the V m i = 
VT(s m ' — s m ). But on £>, V m > = U m > where 



U m , = ^J t t < T y Sm ,- Sm (t)(dN t - * a (t)dt). 



So one can use Corollary 1 of [20]: with probability larger than 1 — e x , 

1 b 

U m > < v2vx + — x, 

where v and b are constants such that for all t <T, 

V -^J h<r^s m ,- Sm (t) 2 ^ s (t)dt and b>l t < T ^ {Sm ,_ Sm) (t)\ 
As we stop all the processes in r, we obtain that the following choices work: 



v 



(77 + HM)R 2 ., l2 , , 2iW 
= \\s m > - s m \\ and 6 = — , 



T 
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since the infinite norm of s m is also bounded by H . Consequently, on B with probability 
larger than 1 - #{Mr}e~ x 



(8.12) 



Oq + HM)R 2 2HM 
-x H ^^x. 



T 3T 

But || s m — s m || < \\srh — s\\ + ||s — s m ||. Thus, with the same constant 9 as in (8.11), 



Or] + HJ\f)R 2 „ / Oq + HJ\f)R 2 2HM 
A 2 < 2\\ Srh - s\\\l 2^ ^_ x + 2| s - Sm |U/2^ —^x + -^x 



3T 



< 0Srn-S 2 +9s m -S 2 + -^ — X + —^X. 



3T 



We finally obtain: 
(8.13) 



Now let us go back to (8.7). Using (8.11) and (8.13), we have actually obtained that on B and 
on an event Sl x whose probability is larger than 1 — 3#{M.T}e~ x , the following inequality is 
true 

Dl(s - a) < D 2 T {s m - s) + 9[\\s - Srhf + ||s m - s|| 2 ] +9\\s- s m \\ 2 + 



1 

+ 7T 



(l + e)(r ] + HM)R 21 -^— 1 + 3^ + (1 + e 



T 



J"2 



4 2 \ (r? + HM)R 2 . , . . , 

+ I - + tt^tt I ^ £ + pen(m) — pen(m). 



6» 3R 2 J 

As denotes the orthogonal projection for ||.|| of s on S, we can remark that 



II ~ _ - II 2 4- II . II 2 — ll~ _ II 2 



i ~ ii 2 ii ii 2 

| s — || + \\s± — s\\ . 



Moreover 



D 2 T {s - 8± ) = ^ f (* s -.(t) + 1- s - sx (t)) 2 ^ < (1 + e)Z?r(s - a) + (1 + ^^(a - a ± ). 
J JO 



Hence we obtain that on B D fi 7 



£>t(s - s ±) < (1 + z)DUsm - a) + (1 + £ -1 )-D|(s - a ± ) + (1 + e)0[|a - a ± f + ||s ± - af] + 



+ (! + £) 



+ (1 + e)0||a - a m || 2 + (1 + e)pen(m) 
i(l + e)(?7 + ff.AOi? 2 l m i r +1 (l + 3V2^) 2 -pen(m) 



+ 



+ (1 + e) (* + X) H + y^x + (l±£)0±£^) i±^V4 l2 



3i? 2 



I 1 



J>2 
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But on £>, Dj,(s—s±) > r 2 \\s— s±\\ 2 since s—s± belongs to S. Hence if we choose 9 = r 2 {l+e) 2 , 
we obtain 

2 

Y^\3 - s ± f l Bn n x < (1 + e)D 2 T {s m - s) + (1 + e" 1 )!^ - s ± ) + (1 + e)0|s ± - s| 2 + 

+ (1 + e)9\\s - Smf + (1 + e)pen(m) 

4 J2_\ (77 + HN)R 2 (l + e)(l + £- 1 ) l+M 2 /£ 2 
+ 3R 2 J T X+ 9 T 2 x ■ 

It remains to add er 2 (l + e) _1 ||s^ — splgn^ on both sides, to obtain (7.1). For (7.2), let 
us take the expectation on both parts. We can remark that E(D^(s m — s)) = \\s m — sf^ < 
K 2 \s m — s\ 2 , by applying Lemma 2 and similar computations hold for s±. Moreover remark 
that \\s — s±\\ < I % s||, since S m is a subset of S. This concludes the proof. D 

Remark 4 In case of self-inhibition (see (2.14) and Remark 3), it is sufficient to replace T 
by T n T' where 

r = {t i \(t) = o} 

and to define accordingly the stopping time r to obtain (7.1). 
8.3. Proof of Proposition 5. The control of B is twofold. 

On one hand, one needs to control the number of points in any interval of length A. The 
control of the number of points in one interval comes from some tedious computations that 
have been done in [22]. Then the control for any interval comes from a reasoning that is close 
in essence to the control of the suprema of identically distributed variables with exponential 
moment. 

On the other hand, one needs to control the deviations of D^(f) from its mean for / in 
a finite vectorial subspace. We decompose the problem in controlling the deviations of the 
associated bilinear form for elements of the basis. Those deviations are controlled by using a 
concentration inequality for Hawkes processes that has been derived via coupling in [22]. 

Proof. Through this proof, the value of To will change from line to line but the dependency 
in the parameter is clearly written. We keep the notations of the previous proof. First let us 
introduce 

(8.14) Bb = {Vt€[0,T], N([t-A,t))<M}, 

and 



+ (! + £) 



(8.15) 



B l = {\ffeS, r 2 |/| 2 <£>|,(/) <i2 2 ||/|| 2 }. 
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Then F(B C ) = P(Bg) + F(Bf n B ). 



First let us control Let 



B' 



Vfc € {0, . . . , 



+ i}, jv([(*-i)a,m))<^L 



where [a^J denotes the largest integer smaller than x. Then £>q C Bq and P(jBq) — ^(^o c )- 
But by stationnarity, 



+ 2 



N{[-A,0)) > 



Proposition 2.1 of [22] tells us how to control the deviation of the number of points per 
interval of fixed length (here A). Hence there exists a positive increasing function in p 
(see (2.1)), namely m p (z), such that 



.16) 



N([-A,0)) > 



for all z < p — logp — 1. In particular one can take z = P — log P — 1 and replace m p {z) 
by mp(z) in (8.16). With A/" as in the proposition, one gets that 

^T),A,P 



F(B C ) < 



J*2 



• Now let us control P(£>f n Bo). Let us introduce 



(8.17) 
and 

(8.18) B'l 



v/ g r, 



iy [JV>(t) -E(JV>(t))]dt 



v(/,/')er 5 



[iV J (t)iV//(*)-E(^j(<)iV / /(t))]c« 



< 



where Nj(t) = v I / (o,i / )(*) an d where the xj's and xjj/'s are positive numbers that will 
be chosen later. The control of these events is based on the following lemma which is a 
direct consequence of Case 3 of Proposition 3.3 of [22]. 

Lemma 4. Let g be a function of the points of (Nt)t^R lying in [—A, 0) with values in 
[—B,B]. Let (0t)t£R be the flow induced by {Nt)teR i- e - 9 @t is the same function as 
before, but now the points are lying in [—A + t,t). Then there exists a positive constant 
T (P, A) such that for all T > T (P, A) 



T 



— / go6 t dt 



> |%| +24 



'ciVar^^log^) 2 c 2 BA\og{Tf 



T(P - log P-l) T(P - log P-l) 
where c\ and C2 are absolute constants. 



< 



□ 



rj,P 



J>3 ' 
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First, let us control P(B'f n Bo) by applying the previous lemma to 

g = min(N I ,M) - E(iVj). 
As E(iVj) = z/^(I)(l -p)~\ there exists T (r),p, A) such that for all T > T , 

E(JV» < A". 

Thus B = N works in Lemma 4 as soon as T is large enough. Moreover 

Var(#) < ME(Nj). 

Finally, by stationnarity, 

\E(g)\ <E(N I l Nl > Af ) = E(N I (0)t Nl{o) > Af ). 
IfB' ' denotes {N([-A,0)) < A"}, then 

|E(<?)| <E(iV 7 (0)l B » c ). 
Consequently, we set for all 5 > 

^llog(T) 2 



x 7 = E(JVj(0)lBgb) + <5E(Aj) + 
By Lemma 4, one has then that 



— — + c 2 AA 



T(P-logP-l) 



□„,p 1 



T 3 ~ T 2 VT' 
Now, let us control P(23" c n Bq) by applying Lemma 4 to 

<7 = min (JVjJVji, AA 2 ) -E(N I N r ). 

First remark that 



E(N!N r ) < E(iV([-^,0)) 2 ). 



Let us apply Lemma 1, then 



E(AWp) <E(A([-A,0)) z ) < - 



2 s z^ 2 A 2 zM 77 2 ,4 2 77^4 



+ 77 77 < ' + 



(1-p) 2 (1 -p)3 - (1 -P)2 (1-P) 3 

So there exists T (r?,P,A) such that for all T > T , E(NiNp) < A/" 2 . Thus B 
works in Lemma 4 as soon as T is large enough. Moreover 

Var(#) < M 2 E(N I N P ). 
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Finally 

|E( 5 )| < E(iVj(0)iV r (0)1^(0)^(0)^) 

< E(N I (0)N r (0)t {NimM or Nl , {0) >M}) < 1(^/(0)^,(0)1^). 

Consequently 

x I)V = E(iV T (0)iVj/(0)l s »c) + SE(NjNj>) + 
gives for T > Tq(t], P, A) 

p(ernB )<|r| 2 ^f 



Cl M 2 



+ c 2 M 7 



A\og{Tf 
T(P-logP-l)' 



P((fii / nfiirne )<n,,P^. 



Finally we proved that 

Now it remains to prove that for T large enough, B'{ n fi^ C Si. Let / G S = Sr, f ^ 0. 
Then we can write / = (/z, g) where 



Doing the same kind of development as in the proof of Lemma 2, one has 
^ T ^(o, 9 )W-E^(o, 9 )(t))]^ 



D 2 T (f)-\\f\\ 2 D = 2/x 
So one has 

|^(/) -WfWli 



+ 



| Q /| [opj 



^ /" (JV>(i)JV//(t)-E(JV/JV>/ 



))<tt 



On 0" n this gives 

1 — fev^CO /Krv^COv 7 ^ 7 ) 



Let us introduce 



= y- JM-ij. 
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Then for T > T (r),P,A) 

\D 2 T (f)-\\f\\ 2 D\<M 1 + M 2 + M 3 , 



where 



Mi = 2| M |<5E(*(o, ff+ )(0)) + 5E(* (0i3+) (0) 2 ), 
M 2 = 2| M |E(* (0lfl+ )(0)l fl& *) + E(* (0i9+) (0) 2 1^ C ), 



and 

M 3 . 



/?(/) 



5 



+ c 2 AA 



+ E- 



a j a H 



5 



+ c 2 A/" : 



Alog(T) s 



First let us remark that 2|/i|E(*( 0ifl+ )(0)) + E(* (0i5+) (0) 2 ) 
less than Id//!,^)! 2 -,. But applying Lemma 2, we obtain 



T(P - log P- 1)' 
5+) Hi) ~~ A* 2 5 which is 



S+)l2> < K 2 m,g + )f < 2max (l, ^^fM + (1 - P)" 1 )) 



Let us take 5 = (log T) 1 . This gives 



logT 



Next remark also that 



l a /| 



* (0>g+) (0)<iV([-A0))su p . 

Jer 



But integrating (8.16), one obtains that 

, 2 



E(iV([-A0))"l Bfifc ) < D^p.a-^- and E(N([-A,0))1 b »c) < O v ,p,Aj^- 
Keeping in mind that 1(1) = A/\T\ it remains 



Finally 



M 2 < D VtPtA 

m 3 < UpAir^-W 2 



T 5/2- 

(logT) 2 , 



Hence we obtain that for T > To (A, rj, P) 



□ 



ri,P,A 



logT 



\Dl(f) 



< □ 



m P A - 
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Using Lemma 2, this gives 

logT " l/l 2 " + logT- 
The assumptions on i? 2 and r 2 imply that for T > Tq(A, rj, P, p, R 2 , r 2 ) 



logT 



logT 



This gives that n B'{ C £>i and this concludes the proof. 



□ 



8.4. Proof of the minimax results (Propositions 2, 3 and 4)- We first need two important 
lemmas. 

Lemma 5. Let f = (p,g) and s = (v,h) be two elements of L 2 such that p^v > 0, 
g,h > 0, J g < 1 and J h < 1. Let Fj ' , respectively ¥ s ' , 6e £/ie distribution of a 
stationnary Hawkes process with intensity 5 r /(-), respectively x f s (-), restricted to [—A,T]. Then 
the Kullback-Leibler distance satisfies 

cT 



K(py 



■ A - T ] p[-A,T] 



>=*(/ 



log 



*/(i)cft) +K(P^' 0] ,P[- A '°]), 



where (j)(u) = e u — u — 1 and Ej represents the expectation with respect to P^ 
Moreover if f and s belong to C^jfp and if Ajhloo < P — logT — \, then 

K(p^ 3 1,pM.n) < Tdl/ - s|| 2 + C 2 , 

where C\ and C2 are positive constants depending only on A, H, P, 77, /). 

Lemma 5 shows that the Kullback-Leibler distance between two different processes linearly 
increases with T. It also clarifies the link between the natural Kullback-Leibler distance and 
the L 2 -norm, ||.||, we used. 

Proof. Let us denote by P^ |[-A,o] t ne conditional distribution of the points of the 
process lying in [0,T] conditionnally to the family of points lying in [—^4,0]. Then the clas- 
sical decomposition of the Kullback-Leibler distance with respect to the marginals gives the 
following decomposition. 

,[o,n 



/ 



-AT] p[-A,T] 



)=E 



In 



r/P 
r/I 



»[o.n 



-A,0] 
-A,0] 



+ K(P 



-Aft] 



)P [-A, 0]) _ 
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Next we combine Example 7.2(b) with Proposition 7. 2. Ill of [9] to obtain that the conditional 
likelihood ratio is 



d¥ 



[0,21 
/ 



-A,0] 



,[o,n 



-A,0] 



[ ]n[y f (t)/V s (t)]dN t - [ V f (t)dt+ f * s (t)dt 
Jo Jo Jo 



Using the martingale properties and the fact that the intensity is predictable, one gets the 
first equation of Lemma 5. Now to upper bound the Kullback-Leibler distance, we need first 
to remark that Vx > — 1, log(l + x) > x/(l + x) which gives that 



E 



/ 



log 



(t)dt) 



< E 



/ 



r 2£W 4 )<I, 



It is important to note that here (and only here) ||.||£> is computed with respect to / and 
not s. Now it remains to use Lemma 2 and to upperbound the constants depending on / by 
constants depending on A, H, P, rj, p to obtain the first part of the inequality. 

Then it remains to upper bound K(P^ A0] , Pi" Afi] ). It is easy to see that in fact if we 
denote by N the process on [—2^4, 0], then N = NxL)N 2 where N 2 is the process on [—2^4, — A) 
and Ni is the process on [—A, 0]. Then if we denote by dV the law of a homogeneous Poisson 
process with intensity 1 on [—^4,0], one see that 



dF [ f A > 0] (N 1 )=E f>N2 



exp 



hi[* f (t)]dNi(t)- / [*,(*) -l]d(i) 



dV{Nx 



where Ejjv 2 means that we integrate with respect to the N 2 component with the stationnary 
law of a Hawkes process with intensity Vt/Q. Consequently, 



K(P^ A0] ,p[-^)=E /jAri (in 
So, since ^/() is positive, 



E f>N2 exp ( f_Vn[%(i)]diVi(t) - f° A * f (t)d(t) 



with 



and 



Ax =E f:Nl lnE /iiV2 



An 



E 



f,Ni 



InE 



E S:JV2 ex P (/° A ln[* s (t)]diVi(t) - f_ A V s (t)d(t) 
l (pKo] p [-AO]) <Ax+A 2 

exp(J°^n[y f (t)]dNx(t)) ) 

exp (^J W s (t)dt - J s (t)dNx(t) 



For Ax, we can upper bound it on Cjj 



Ax < E ftNl (lnE SjA r 2 [(p + H(Nx + N 2 )) Nl ]) 
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But for all integer I one can upper bound E /iA r 2 [JV|] by l\E(exp(zN 2 ))/z l and we know that 
this is a bounded quantity by Proposition 2.1 of [22] for z = P — \ogP — 1. Hence 



E 



f,N 2 



Ni 



El" 1 



,JVi— Z rW 



z=o 



iVi 



< D W E I! ( T 1 ) (P + HN^iH/z) 



1=0 



< a^H^pN^p + H/z + HN^K 

Hence Ai < IZU^^^pE^iV 2 ) and Ey(iV 2 ) is bounded by similar arguments. It remains to 
bound A 2 . But 



E 



s,N 2 



exp 



pO 

V s (t)dt- / kiV a (t)dNi(t) 

-A J-A 



< E s N2 [ e ^(p+l^loo(A r i+A r 2))-ln(p)7V 1 j 

< E S<N2 [ e A \ h \°° N *] e Ap+AHNi —ln(p)Ni 



So if Ajhloo < P-logP- 1, Proposition 2.1 of [22] gives an upperbound for E SjN2 [ e A \\ h W^] , 
namely E Sj n 2 [e 2Ar2 ] with z = P — logP — 1 which is bounded by some constant depending 
on A, H, p, r], P. As previously 



if A 

A2 < OA,H,r),p,pEf(Ni) < □A,H,r?,p,P 1 _ p ■ 



This concludes the proof. 



□ 



Lemma 5 combined with Birge's Lemma [4] gives the following result, which is ready to 
use for the different lower bounds in the different situations. 



Lemma 6. Let S be a family of possible s such that ^ s (.) is the intensity of a stationnary 
Hawkes process, and such that s belongs to . Let 5 > and let C C S be a finite family 
such that for all f = (fi,g) £ C, Ajgloo < P — logP — 1. Then there exists Ci end C2 two 
particular positive functions of rj, p, A, P, H such that if for all f ^ f in C 



Cilog|C| - C2 



>II/-/T><5 



then 



infsupE s (||s - s|| 2 ) > 



5(1 -a) 



where a is an absolute positive constant (see [4] for a precise value) 
Proof. First it is very classical to obtain that 



inf supE s (||s — s 
« ses 



| 2 ) > - inf supE s (||s-s|| 2 ) 
4 sec se c 
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But 

K s {\\s-s\\ 2 )>5F s (s^s). 

So 

infsupE s (||s - sf) > - inf ( 1 - inf F s (s = s] 

s sei 5 4 sGC \ s6C 

It remains to apply Birge's Lemma [4], by upper bounding the mean Kullback-Leibler distance 
on C. Using Lemma 5, it remains only to choose d and C2 according to C\ and 62- This 
concludes the proof. □ 

Proof. It is now sufficient to apply the previous Lemma for good choices of C. 

Proposition 2 Let m be a model. We set D = \m\. Let Vq be the maximal collection of 
subsets of m, such that for all X 7^ X' in Vq, \IAI'\ > 9\m\, then by [10], one has that 
log I Pol > cr\m\, for 6 and a some absolute constants. 
Let 

where e is a positive real number that will be chosen later. To ensure that Cq C CP^fp 
we need that e < min(iT, P/A)yflQ. Moreover to apply Lemma 6 we need that e < 



Now, for all fx, fx' in C 



Moreover 



/ X /|| 2 = \lAl'\e 2 > 6De 2 . 
\\fx- fxf <e 2 D. 



Finally taking 

e 2 = min ( , l mm(H, P/A, (P - log P - 1)/A) 2 ^j , 

and applying Lemma 6 gives the result. 
Proposition 4 Let T be a partition of (0, A] and let us concentrate first on the Islands set. 
Let V\ be the maximal collection of subsets of T with cardinal D, such that for all 1^1' 
in V\, \ZAI'\ > 6D, then by the appendix of [19], one has that log \V\\ > aDlog ^, for 
6 and a some absolute constants. Let 
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Then the same computations as before give the result for the Islands set. But note that 
the set C\ is also included in $W2.D+1)" Consequently the lower bound is also valid up 
to some multiplicative constant for SjFqd+i) ■ 
Proposition 3 For the holderian family, let ip be a positive continuous function on K, null 
outside (0, A] and such that for all x,y £ HL, |y(a;) — y(y)| < \x — y\ a . Remark that a 
quantity that only depends on ip actually depends on A and a. 

Let m be a regular partition of (0, A] in Z) pieces. Let pd{x) = LD~ a p(Dx). Let Vq be 
defined as before and 



C 2 = ls x = (p, ^ </?dO - ui)),T e? >, 
I /ex J 



where u/ is the left extremity of I. To ensure that C2 C and that ||g||oo < (P — 
log P — 1) /yl, we need that D > c(A, a, H, P)L 1 / a , for some positive continuous function 

a 

But for all sx, sx' i n C2, 

\\s x -s xl f = \XAl'\L 2 D- 2a - 1 j p 2 >eL 2 D~ 2a j p 2 . 



Moreover 



11 „ „ 11 2 s r2 n-2a / ,2 

I st — s%i || < L L) l tp . 



But note that for D large enough C,\aD — £2 > C'-^ f° r some other constant C'- 
It remains to choose 

to obtain the result. 

□ 

9. Conclusion. We proposed a method based on model selection principle for Hawkes' 
processes that is proved to be adaptive minimax with respect to certain classes of functions. 
In practice, the multiplicative constant in the penalty is calibrated in a data-driven way that 
is proved to work well on simulations. In particular we designed a new method - namely 
the Islands strategy coupled with the angle penalty - that seems to be really adapted to 
our biological problem, namely characterizing the dependence between the occurrences of a 
biological signal. Moreover it allows us to estimate the right range of interaction. 

This work asks however for several future developments. First it is necessary to treat inter- 
action with another type of events (for instance promoter /genes) with the Islands strategy. 
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Next a test procedure should be applied to know whether the function h is really non zero. 
This would be equivalent to testing whether there exists an interaction or not. 
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